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A class of estimators of the Renyi and Tsallis entropies of an un- 
known distribution / in R™ is presented. These estimators are based 
on the fcth nearest-neighbor distances computed from a sample of A'^ 
i.i.d. vectors with distribution /. We show that entropies of any order 
^-H ' q, including Shannon's entropy, can be estimated consistently with 

, minimal assumptions on /. Moreover, we show that it is straightfor- 

ward to extend the nearest-neighbor method to estimate the statis- 
tical distance between two distributions using one i.i.d. sample from 



(N 



■ each. 



> 



1. Introduction. We consider the problem of estimating the Renyi [33] 
entropy 



(1.1) i/* = J_log/' f'^{x)dx, q^l, 

O . ^ 1-q Jr™ 

m ■ 

l/^ ■ or the Havrda and Charvat [15] entropy (also called Tsallis [37] entropy) 

^\ (1.2) Hg = ^(l- [ f'^ix)dx), q^l, 

' of a random vector X S M"* with probability measure // which has density 

/ with respect to the Lebesgue measure, from A'^ independent and identi- 
cally distributed (i.i.d.) samples Xi, . . . ,Xi\i, N>2. Note that H* can be 
?-H ' expressed as a function of Hg. Indeed, H* = log[l — (q — l)Hq]/{l — q), and 
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for any q, d{H*)/d{Hg) > and [d^{H*)/d{Hgf]/{q - 1) > 0. For g < 1 and 
q > 1, H* is thus a strictly increasing concave and convex function of Hg 
respectively and the maximization of H* and Hg are equivalent. Hence, in 
what follows we shall speak indifferently of g-entropy maximizing distribu- 
tions. When q tends to 1, both Hg and H* tend to the (Boltzmann-Gibbs-) 
Shannon entropy 

(1.3) Hi = - f f{x)\ogf{x)dx. 

We consider a new class of estimators of Hg and H* based on the approach 
proposed by Kozachenko and Leonenko [21] who consider the estimation 
of Hi] see also [11]. Within the classification made in [3], which also con- 
tains an outstanding overview of nonparametric Shannon entropy estima- 
tion, the method falls in the category of nearest-neighbor distances. See also 
[13]. When m = 1, the nearest-neighbor method is related to sample-spacing 
methods; see, for example, [41] for an early reference concerning Shannon 
entropy. It also has some connections with the more recent random-graph 
approach of Redmond and Yukich [32], who, on the supposition that the dis- 
tribution is supported on [0, 1]™ together with some smoothness assumptions 
on /, construct a strongly consistent estimator of H* for < g < 1 (up to an 
unknown bias term independent of / and related to the graph properties). 
For (7 7^ 1, our construction relies on the estimation of the integral 

(1.4) Ig = W.{p-\X)}= j r{x)dx 

through the computation of conditional moments of nearest-neighbor dis- 
tances. It thus possesses some similarities with that of Evans, Jones and 
Schmidt [8], who establish the weak consistency of an estimator of Ig for 
m > 2 and q < 1 under the conditions that / is continuous and strictly 
positive on a compact convex subset C of M*", with bounded partial deriva- 
tives on C. In comparison to Redmond and Yukich [32] and Evans, Jones 
and Schmidt [8], our results cover a larger range of values for q and do not 
rely on assumptions of regularity or bounded support for /. For the sake 
of completeness, we also consider the case q = 1, that is, the estimation of 
Shannon entropy, with results obtained as corollaries of those for qj^l (at 
the expense of requiring slightly stronger conditions than Kozachenko and 
Leonenko [21]). 

The entropy (1.2) is of interest in the study of nonlinear Fokker-Planck 
equations, with q <1 for the case of subdiffusion and q> I for superdiffusion; 
see [38]. Values of q G [1, 3] are used by Alemany and Zanette [1] to study the 
behavior of fractal random walks. Applications for quantizer design, charac- 
terization of time-frequency distributions, image registration and indexing, 
texture classification and image matching etc., are indicated by Hero et al. 
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[16], Hero and Michel [17] and Neemuchwala, Hero and Carson [29]. Entropy 
minimization is used by Pronzato, Thierry and Wolsztynski [31], Wolsztyn- 
ski, Thierry and Pronzato [45] for parameter estimation in semi-parametric 
models. Entropy estimation is also a basic tool for independent component 
analysis in signal processing; see, for example, [22, 23]. 

The entropy Hg is a concave function of the density for q> (and convex 
for q <0). Hence, g-entropy maximizing distributions, under some specific 
constraints, are uniquely defined for q> 0. For instance, the g-entropy max- 
imizing distribution is uniform under the constraint that the distribution 
is finitely supported. More interestingly, for any dimension m > 1, the q- 
entropy maximizing distribution with a given covariance matrix is of the 
multidimensional Student-t type if m/{m + 2) < g < 1; see [43]. This gener- 
alizes the well-known property that Shannon entropy Hi is maximized for 
the normal distribution. Such entropy-maximization properties can be used 
to derive nonparametric statistical tests by following the same approach as 
Vasicek [41] who tests normality with Hi; see also [11]. 

The layout of the paper is as follows. Section 2 develops some of the moti- 
vations and applications just mentioned (see also Section 3.3 for signal and 
image processing applications). The main results of the paper are presented 
in Section 3. The paper is focused on entropy estimation, but in Section 3.3 
we show how a slight modification of the method also allows us to estimate 
statistical distances and divergences between two distributions. Section 4 
gives some examples and Section 5 indicates some related results and pos- 
sible developments. The proofs of the results of Section 3 are collected in 
Section 6. 

2. Properties, motivation and applications. 

2.1. Nonlinear Fokker-Planck equation and entropy. Consider a family 
of time-dependent p.d.f.'s ft- The p.d.f. that maximizes Renyi entropy (1.1) 
[and Tsallis entropy (1.2)] subject to the constraints J^ft{x)dx = 1, J^[x — 
x{t)] X f1{x)dx = 0, ^^[x — x{t)Y f^{x)dx = ag{t), for fixed g > 1, is the 
solution of a nonlinear Fokker-Planck (or Kolmogorov) equation; see [38]. 

Let X and Y be two independent random vectors respectively in M"^i and 
M."^^. Define Z = {X,Y) and let f{x,y) denote the joint density for Z. Let 
fi{x) and f2{y) be the marginal densities for X and Y respectively, so that 
f{x,y) = fi{x)f2{y)- It is well known that the Shannon and Renyi entropies 
(1.3) and (1.1) satisfy the additive property H*{f) = H*{fi) + H*{f2), g S ffi, 
while for the Tsallis entropy (1.2), one has Hg{f) = Hg{fi) + Hg{f2) + 
(1 — q)Hq{fi)Hq{f2). The first property is known in physical literature as 
the extensivity property of Shannon and Renyi entropies, while the sec- 
ond is known as nonextensivity (with q the parameter of nonextensivity) . 
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The paper by Frank and Daffertshofer [10] presents a survey of results re- 
lated to entropies in connection with nonlinear Fokker-Planck equations 
and normal or anomalous diffusion processes. In particular, the so-called 
Sharma and Mittal entropy Hq^s = [1 - [Iq)^"''^'''^'^'^'']/ {s - 1), with g, s > 0, 
q,s^l and Iq given by (1.4), represents a possible unification of the (nonex- 
tensive) Tsallis entropy (1.2) and (extensive) Renyi entropy (1.1). It sat- 
isfies Imis^i Hq^s = H*, lims^q^iHq^s = Hi, Hq^q = Hq and limq^i Hq^s = 
{1 — exp[— (s — l)Hi]}/{s — 1) = H^, s > 0, s / 1, where Hf is known as 
Gaussian entropy. Notice that a consistent estimator of Hq^g can be obtained 
from the estimator of Iq presented in Section 3. 

2.2. Entropy maximizing distributions. The m-dimensional random vec- 
tor X = {[X]i, . . . , [X]m)''' is said to have a multidimensional Student dis- 
tribution T(i^, 5],/i) with mean /i S M"^, scaling or correlation matrix S, 
covariance matrix C = vYjjiy — 2) and v degrees of freedom if its p.d.f. is 

/-(^) = J^J^ 

(2.1) 

r((m-H/)/2) 1 

^ r(z//2) |S|V2[l + (a;-;,)T[j.S]-l(x-/i)]("^+'^)/2' 

X € M™. The characteristic function of the distribution T{i/,T,,fi) is 

<^(C) =Eexp(i(C,X)) =exp(i(C,/x))A',/2(v/K^SC)(v/K^SC)"/'^^, 

C G M™, where denotes the modified Bessel function of the second order. 
If = 1, then (2.1) is the m-variate Cauchy distribution. If + m)/2 is an 
integer, then (2.1) is the m-variate Pearson type VII distribution. If Y is 
AA(0, S) and if vS'^ is independent of Y and A'^-distributed with v degrees 
of freedom, then X = Y/ S + has the p.d.f. (2.1). The limiting form of (2.1) 
as — > oo is the m-variate normal distribution M{fi, S). The Renyi entropy 
(1.1) of (2.1) is 

1 B{q{m + u)/2-m/2,m/2) 

tlq = : log ■ 



l-q ^ B^{v/2,m/2) 

1 / TO \ Til 

+ -log[(^i.)-|S|]-logr - , q> 



2 V 2 / m + v 

It converges as z^ — > oo to the Renyi entropy 



if*(/.,S) = log[(2^)-/2|S|V2] _ _^^iogg 

(2.2) '^^ 

= i7,(/.,E)-^fl+^°^^ 
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of the multidimensional normal distribution M{fi, S). When — > 1, H*{fj,, S) 
tends to Hi{n,T.) = log[(27re)"/2|5]|i/2]^ Shannon entropy of AA(/x,S). 
For m/{m + 2) < q < 1, the q-entropy maximizing distribution under the 
constraint 

(2.3) E{X - i2){X - fi)^ = C 

is the Student distribution T(j>, (i/ — 2)C/z^,0) with u = 2/(1 — q) — m > 2; 
see [43]. For q> 1, we define p = m + 2/(g — 1) and the g-entropy maximizing 
distribution under the constraint (2.3) has then finite support given by Qq = 
{x£W^:{x- ny[{p + 2)C]~\x- n)<l}. Its p.d.f. is 

(2.4) f r(p/2 + 1) 

|C|i/2[vr(p + 2)]W2r((p - m)/2 + 1) 

X [1- (x-;u)^[(p + 2)C]-i(x-/i)]i/('?-i)^ ifxGOg 
0, otherwise. 
The characteristic function of the p.d.f. (2.4) is given by 

m = exp(i(C,/x))2P/2rg + \Cip + 2)CC|-^/Vp/2(|C^(p + 2)CC\), 

C € M™ , where Jj,/2 denotes the Bessel function of the first kind. 

Alternatively, fi, for q < 1 or fp for q > 1 also maximizes the Shannon 
entropy (1.3) under a logarithmic constraint; see [20, 46]. Indeed, when q <1, 
fu{x) given by (2.1) with v = 2/{l — q) —m and Ti = [v — 1)C /v maximizes 
Hi under the constraint 



/ log(l + x^[(z^-2)C]~^x)/(2;)dx = ^ 



and when (7 > 1, f-p{x) given by (2.4) with p = 2/(q — 1) + m maximizes i?i 
under 

log(l - \{p + 2)CY^x)f{x) dx = ^{^^-^ 

where ^'(-z) = Y\z)lY(^z) is the digamma function. 

2.3. Information spectrum. Considered as a function of q^ H* (1.1) is 
known as the spectrum of Renyi information; see [36]. The value of H* for 
q = 2 corresponds to the negative logarithm of the well-known efficacy pa- 
rameter E/(X) that arises in asymptotic efficiency considerations. Consider 
now 

, ^ • dH* 

(2.5) i?i = lim— 2-. 

q-*i dq 
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It satisfies 



= lim 



log/iRm P{x)\ogf{x)dx 



+ 



1 {l-qY {l-q)J^^fi{x)dx 
U[ f{x)[logf{x)]'dx 

ivar[log/(X)]. 



f (x) log f{x)dx 



The quantity S'(/) = —2Hi = var[log/(X)] gives a measure of tlie intrinsic 
sliape of the density /; it is a location and scale invariant positive func- 
tional {S{f) = S{g) when /(x) = cr~^g[{x — fi)/a]). For the multivariate 
normal distribution AA(;U, S), H* is given by (2.2) and S{f) =m/2. For the 
one-dimensional Student distribution with z/ degrees of freedom (for which 
EX"-^ exists, but not EX"), with density 



Mx) 



1 r(zy/2 + l/2) 



1 



(zy7r)V2 r{u/2) (l + x2/i/)(^+i)/2' 



we obtain 
(2.6) 

S{fu) -- 



log 



i?(g(z. + l)/2-l/2,l/2) 1 



9> 



3.2899, for u= \ (Cauchy distribution), 

1.5978, fori/ = 2, 

12 ~ 1.1595, fori/ = 3, 
for 1/ = 4, 
: 0.8588, for i/ = 5, 



9 - f 7r2 ~ 

^-ivT^:. 0.9661, 



115 



and, more generally, S{fu) = (l/4)(zy + l)'^{<^{v/2) - + l)/2]}, with 
^'(x) the trigamma function, ^'(x) = d^ logT {x) / dx"^ . The information pro- 
vided by S{f) on the shape of the distribution complements that given by 
other more classical characteristics like kurtosis. [Note that the kurtosis is 
not defined for /jy when i^ < 4; the one-dimensional Student distribution 
/e and the bi-exponential Laplace distribution fi have the same kurtosis 
but different values of S{f) since S{h) = 147931/3600 - (49/12)7r2 ~ 0.7911 
and S{fL) = 1.] For the multivariate Student distribution (2.1), we get 
S{f^) = (l/4)(z/-Fm)2{^'(i//2) - ^[{u + m)/2]}. The g-entropy maximizing 
property of the Student distribution can be used to test that the observed 
samples are Student distributed, and the estimation of S{f) then provides 
information about i/. This finds important applications, for instance, in fi- 
nancial mathematics; see [18]. 
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3. Main results. Let p{x,y) denote the Euclidean distance between two 
points X, y of M™ (see Section 5 for an extension to other metrics). For a given 
sample Xi, . . . , Xjsf, and a given Xi in the sample, from the — 1 distances 
p{Xi,Xj), j = 1, . . . , N , j i, we form the order statistics /c^^^tv-i — Pi^-i — 
■ ■ • ^ P^N-i N-i' Therefore, pf^jq_i is the nearest-neighbor distance from the 

observation Xi to some other Xj in the sample, j 7^ i, and similarly, p]^ 
is the kth nearest-neighbor distance from Xi to some other Xj. 

3.1. Renyi and Tsallis entropies. We shall estimate Iq, (77^ 1, by 

1 ^ i_ 

(3-1) lN,k,q= J7^{CN,i,k) ^, 

i=l 

with 

(3.2) CN,^,k = {N- l)CkVm{p^^^_^r, 

where Vm = tt"'^'^ /T{m/2 + 1) is the volume of the unit ball B{0, 1) in M"* 
and 



Ck 



m 



r(k + i 



1/(1-9) 



Note that /i = 1 since / is a p.d.f. and that Iq is finite when q <0 only if / is 
of bounded support. Indeed, Ig = I{x:f(x)>i} fi^) dx + J{x:f{x)<i} f^i^) > 
!{x- f{x)<i} f^i^) > pc{x ■ f{x) < 1}, with PC the Lebesgue measure. Also, 
when / is bounded, Iq tends to the (Lebesgue) measure of its support pc{x ■ 
f{x) > 0} when g — > 0"*". Some other properties of Iq are summarized in 
Lemma 1 of Section 6. 



Remark 3.1. When / is known, a Monte Carlo estimator of Iq based 
on the sample Xi, . . . , X^ is 

1 ^ 

(3.3) i^E/'"H^O- 

i=i 

The nearest-neighbor estimator lN,k,q given by (3.1) could thus also be 
considered as a plug-in estimator, iN,k,q= i^/^)J2f=i[fN,k{^i)]'^~^, where 
fN,k{x) = l/{{N-l)CkVm[pk+i,Nix)]'^} with pk+i,N{x) the (/c + l)th nearest- 
neighbor distance from x to the sample. One may notice the resemblance be- 
tween fN,k{x) and the density function estimator fN,k{x) = k/{NVm[pk+i,N{x)] 
suggested by Loftsgaarden and Quesenberry [26]; see also [7, 28]. 
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We suppose that Xi, . . . , Xj^, N >2, are i.i.d. with a probabihty measure 
H having a density / with respect to the Lebesgue measure. [However, if 
H has a finite number of singular components superimposed to the abso- 
lutely continuous component /, one can remove all zero distances from the 
PkN-i computation of the estimate (3.1), which then enjoys the same 

properties as in Theorems 3.1 and 3.2, i.e., yields a consistent estimator of 
the Renyi and Tsallis entropies of the continuous component /.] The main 
results of the paper are as follows. 

Theorem 3.1 (Asymptotic unbiasedness) . The estimator lN,k,q given 
by (3.1) satisfies 

(3.4) E/^,fc,, -^Iq, oo, 

for g < 1, provided that Iq given by (1-4) exists, and for any q£ (1, + 1) if 
f is bounded. 

Under the conditions of Theorem 3.1, E(l — lN,k,q)/{Q — I) —>■ Hg as N ^ 
oo, which provides an asymptotically unbiased estimate of the Tsallis en- 
tropy of /. 

Theorem 3.2 (Consistency). The estimator I ^^^ q given by (3.1) satis- 
fies 

(3.5) lN,k,q^Iq, N^OO, 

(and thus, lN,k,q ^ Iq? N ^ oo) for q <1, provided that l2q~i exists, and 
for any q G (1, {k + l)/2) when k>2 [resp. q G (1,3/2) when k = 1] if f is 
bounded. 



Corollary 3.1. Under the conditions of Theorem 3.2, 

(3.6) HM,k,q = (1 - iNAq)/{^ - 1) ^ ^9 

and 

(3.7) H*N,k,q = \og{iM,k,q)/{l -q)^ H; 

as N ^ oo, which provides consistent estimates of the Renyi and Tsallis 
entropies of f . 

We show the following in the proof of Theorem 3.2: when q < 1 and 
hq-i < oo, or 1 < q < {k + 2)/2 and / is bounded, 

Wfr^-" M2 A T r(fc + 2-2g)r(fc) 2 ^ ^ 

^iCN,^,k-Q ^^k,q = hq-l ^2t^^l_ \ N ^ OO. 
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Notice that linifc^ooAfc^g = /2g-i - Iq = var[/'?~^(X)] = iVvar[^ x 
J2iLi f'^~^iXi)], that is, the hmit of Ak^q for k ^ oo equals times the 
variance of the Monte Carlo estimator (3.3) (which forms a lower bound on 
the variance of an estimator Iq based on the sample Xi, . . . ,Xi^). 

Under the assumption that / is three times continuously differentiable 
yU£-almost everywhere, we can improve Lemma 2 of Section 6 into 

VmR"" Jb{x,r) 2(m + 2) ^ dxf 

which can be used to approximate Fj\f^x,kiu) — F^^kiu) in the proof of Theo- 
rem 3.1. We thereby obtain an approximation of the bias Bj\[^k,q = ^lN,k,q — 
Iq = E^^~^^^ — Iq, which, after some calculations, can be written as 

( {q-l)(2-q)Iq 



B 



N,k,q ■ 



2^ +Oil/N'), form = l, 

-[{k + 1- q)Jq-2/{87r) + (2 - q)Iq/2] + Oil/N^/^), 



for m = 2, 

q-lT{k + l + 2/m-q) y 
NV^ D^T{k + l-q) J,~i~2/m + 0{l/N' ), 

for m > 3, 

where J/s = V f{x)iJ:Zid^fix)/dxf)dx and = 2(m + 2)V^^"'. For 
instance, for / the density of the normal J\f{0,a'^ Im), we get 

ml (3 

^ " ~^ (27r(T2)"^/3/2 (/3 + l)l+m/2 ' 

which is defined for (3> —1. From the expression of the MSE for lN,k,q given 
in (6.8), we obtain 

(3.8) E(/;v,fc,, - Iqf = %^ - 2/,SAr,,, ,(1 + 0(1)) + mCM:lkCNlk) - Iq]- 

Investigating the behavior of the last term requires an asymptotic approx- 
imation for Fj\i^,j.,y,kiu,v) — Fx^k{u)Fy^f:{v) (see the proof of Theorem 3.2), 
which is under current investigation. Preliminary results for k = 1 show 
that the contribution of this term to the MSE for lN,k,q cannot be ignored 
in general. 

3.2. Shannon entropy. For the estimation of Hi {q = 1), we take the 
limit of Hjsf^k^q as g — > 1, which gives 

1 ^ 

(3.9) ^^7V,fc,i = 

i=l 
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with 
(3.10) 



where ^{z) = T'{z)/T{z) is the digamma function [^'(1) = —7 with 7 ~ 
0.5772 the Euler constant and, for k > 1 integer, ^{k) = —7 + Ak~i with 
^0 = and Aj = J2i=i see [22, 42] for apphcations of this estimator in 
physical sciences. We then have the fohowing. 

Corollary 3.2. Suppose that f is bounded and that Iq^ exists for some 
qi < 1. Then Hi exists and the estimator (3.9) satisfies HN,k,i ~^ Hi as 



Remark 3.2. One may notice that -ffjvfcg given by (3.7) is a smooth 
function of q. Its derivative at g = 1 can be used as an estimate of Hi defined 
by (2.5). Straightforward calculations give 



dHh,k,g ^ ^jk) 1 ^ 

1=1 

N 



1 



N 



N 



1 



i=l 



and S{f) = —2Hi can be estimated by 

N 



(3.11) 



1 



i=l 



We obtain the following in the proof of Corollary 3.2: 



E(ioge 



N,Lk 



Hif^va.ii\ogf{X)] + ^{k), 



N- 



00, 



with '^{z) = d^logT{z)/dz^ [and, for k integer, ^{k) = Ei^fcl/i ]• Note 
that var[log/(X)] forms a lower bound on the variance of a Monte Carlo 
estimation of Hi based on log/(Xj), i = 1,...,N, and that ^{k) ^ as 
k — > 00. 

Similarly to Remark 3.1, the estimator Hj\f^k,i given by (3.9) could be 
considered as a plug-in estimator, HN,k,i = —i^/^)J2iLi^o&[fNk(-^i)] with 
f'N k(^) = exp[^'(A:)]/{(A^ — 'i^)Vm[pk+i,N{x)]"^}- One may notice that select- 
ing k by likelihood cross-validation based on the density function estimator 
suggested by Loftsgaarden and Quesenberry [26], fN,k{x) = k / {NVm[pk+i,N{x 
amounts to maximizing —HN,k,i +log A; — ^{k)^ with log k — ^{k) = l/{2k) + 
1/(12A;^) + 0(l/fc^), k 00. In our simulations this method always tended 
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to select k = 1; replacing fN,k{x) by f'^ k{x), or by fN,k{x) of Remark 3.1, 
does not seem to yield a valid selection procedure for k either. 

Let HN,k,i be the plug-in estimator of Hi based on f^^k defined by 
HN,k,\ = "(1/-^) Si^i log[/7V,fc(-'^i)]- Then, under the conditions of Corol- 
lary 3.2, we obtain that limTv^oo IE-^Af,fc,i = Hi + '^{k) — \ogk (since HN,k,i = 
HN,k,i + ^{k) — log A; -|- log[iV/(A^ — 1)]). Under the additional assumption 
on / that it belongs to the class J- of uniformly continuous p.d.f. satisfy- 
ing < ci < /(x) < C2 < oo for some constants ci,C2, we obtain the uniform 
and almost sure convergence of HN,k,i to Hi{f) over the class provided 
that k = kjsf oo, k^/N and k^ /\ogN — > oo as iV ^ oo; see the re- 
sults of Devroye and Wagner [7] on the strong uniform consistency of fN,k- 
Notice that the choice of k proposed by Hall, Park and Samworth [14] for 
nearest-neighbor classification does not satisfy these conditions. 

3.3. Relative entropy and divergences. In some situations the statistical 
distance between distributions can be estimated through the computation 
of entropies, so that the method of kth. nearest-neighbor distances presented 
above can be applied straightforwardly. For instance, the g-Jensen difference 

jf (/,g) = H;[of + (1 - i5)g] - miif) + (1 - miig)], < /3 < 1, 

(see, e.g., [2]) can be estimated if we have three samples, respectively dis- 
tributed according to /, g and j3f + {l — (3)g. Suppose that we have one 
sample Si (i = 1, . . . , s) of i.i.d. variables generated from / and one sample 
Tj (j = 1, . . . ,t) of i.i.d. variables generated from g with s and t increasing 
at a constant rate as a function of A'^ = s -|- t. Then, H*(f) and H*{g) can 
be estimated consistently from the two samples when N ^ oo; see Corollary 
3.1. Also, as — > oo, the estimator H^ ^ based on the sample Xi, . . . , Xn 
with Xi = Si (i = 1, . . . , s) and Xi = Ti-s (i = s -|- 1, . . . , N) converges to 
H*[Pf + (1 — P)g], with P = s/N, and can therefore be estimated consis- 
tently from the two samples. This situation is encountered, for instance, in 
the image matching problem presented in [29], where entropy is estimated 
through the random graph approach of Redmond and Yukich [32] . As shown 
below, some other types of distances or divergences, that are not expressed 
directly through entropies, can also be estimated by the nearest-neighbor 
method. 

Let K{f,g) denote the KuUback-Leibler relative entropy, 
(3.12) K{f,g)= [ f{x)log^dx = Hi-Hi, 

where Hi is given by (1.3) and Hi = — J^m f{x) log g{x) dx. Given in- 
dependent observations Xi, . . . , Xjy distributed with the density / and M 
observations Yi,...,Ym distributed with g, we wish to estimate K{f,g). 
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The second term Hi can be estimated by (3.9), with asymptotic properties 
given by Corohary 3.2. The first term Hi can be estimated in a similar man- 
ner, as follows: given Xi in the sample, i E {1,...,A^}, consider p{Xi,Yj), 

J = 1, . . . , M, and the order statistics Pi\,j < m ^ " " " ^ Pm Mi ^^sA, p^^j^ 
is the kih nearest-neighbor distance from Xi to some Yj, j S {1, 
Then, one can prove, similarly to Corollary 3.2, that 



,M}. 



(3.13) 



H 



1 



N,M,k 



N 



-^\og{Mexp[-^ik)]V^,ip'^;,,j 

i=l 



is an asymptotically unbiased and consistent estimator of Hi (when now 
both N and M tend to infinity) when g is bounded and 

(3.14) J,= f f{x)g'i~\x)dx 

exists for some q<l. The difference 



H 



N,M,k 



H 



N,k,l 



mloe 



N 



M 



1/N 



+ logM-^(A:)+logV;^ 



(3.15) 



— mlog 



N 



N 



1/N 



l0giN-l)+^ik)-\0gVrr. 



mlog 



■N -Ai) 

n Pk,M 

■«=1 Pk,N 



1/N 



+ log 



M 



N-1 



thus gives an asymptotically unbiased and consistent estimator of K(f,g). 
Obviously a similar technique can be used to estimate the (symmetric) 
Kullback-Leibler divergence K{f,g) + K{g,f). Note, in particular, that 
when / is unknown and only the sample Xi, . . . ,X]^ is available while g 
is known, then the term Hi in K{f,g) can be estimated either by (3.13) 
with a sample Yi,. . . ,Ym generated from g, with M taken arbitrarily large, 
or more simply by the Monte Carlo estimator 

N 

(3.16) 



1 



HiM9) = -j^Y.^ogg{X,), 

i=l 

the term Hi being still estimated by (3.9). This forms an alternative to the 
method by Broniatowski [6] . Compared to the method by Jimenez and Yu- 
kich [19] based on Voronoi tessellations (see also [27] for a Voronoi-based 
method for Shannon entropy estimation), it does not require any compu- 
tation of multidimensional integrals. In some applications one wishes to 
optimize K{f,g) with respect to g that belongs to some class G (possi- 
bly parametric), with / fixed. Note that only the first term Hi of (3.12) 
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then needs to be estimated. [Maximum likelihood estimation, with g = ge 
in a parametric class, is a most typical example: 9 is then estimated by 
minimizing Hi^j\f{gg); see (3.16).] 

The Kullback-Leibler relative entropy can be used to construct a mea- 
sure of mutual information (MI) between statistical distributions (see [22]) 
with applications in image [29, 44] and signal processing [23]. Let Oj and 
bi denote the gray levels of pixel i in two images A and B respectively, 
i = 1,. . . ,N. The image matching problem consists in finding an image B 
in a data base that resembles a given reference image A. The MI method 
corresponds to maximizing K{f, f^fy), with / the joint density of the pairs 
{ai,bi) and fx (resp. fy) the density of gray levels in image A (resp. B). 
We have K{f,fxfy) = -H'i(/x) + Hi{fy) - Hi{f), where each term can be 
estimated by (3.9) from one of the three samples (aj), (hi) or {ai,bi) (but A 
being fixed, only the last two terms need be estimated). 

Another example of statistical distance between distributions is given by 
the following nonsymmetric Bregman distance 

g^'ix) + -^fix) - -^fix)g'^~\x) 



Dgif,g) 
(3.17) 



or its symmetrized version 



q-V ^ ' q-V 



dx, 



K,{f,g) = ^[Dg{f,g)+Dg{gJ)] 



1 



1 



[f{x)-g{x)][f''-\x)-g^-\x)]dx; 



see, for example, [2]. Given independent observations from / and M 
from g, the first and second terms in (3.17) can be estimated by using 
(3.1). In the last term, the integral Jq given by (3.14) can be estimated 

by /jv,M,fc,, = (l/A)EiIi{MCfcK.(/^g\,,)™}i-'?. Similarly to Theorem 3.1, 

lN,M,k,q is asymptotically unbiased, N,M ^ oo, for g < 1 if Jg exists and 
for any q £ (l,k + 1) if g is bounded. We also obtain a property similar to 
Theorem 3.2: lN,M,k,q is a consistent estimator of Jg, N, M — > oo, for g' < 1 if 
J2g-i exists and for any q £ (1, {k + 2)/2) if g is bounded. (Notice, however, 
the difference with Theorem 3.2: when q> 1 the cases k = l and k>2 need 
not be distinguished for the estimation of Jg and the upper bound on the 
admissible values for q is slightly larger than in Theorem 3.2.) 
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4. Examples. 

4.1. Influence ofk. Figure 1 (left) presents H* as a function of q (solid 
line) for the normal distribution AA(0,/3) in M^, together with estimates 
^Nkq ^ ~ 1,...,5 obtained from a single sample of size N = 1000. 
Note that kq '^^ defined only for q <k + 1 and quickly deviates from 
the theoretical value H* when q> [k + 1) /2 or q <1 (the difficulties for 
q small being due to / having unbounded support). For comparison, we 
also compute a plug-in estimate of H* obtained through a (cross-validated) 

kernel density estimate of /. Define ^ = \og{lN,q)/0- ~ q) ^^^'^ ^N,q = 
{l/N)j:l,fXl{X,)withfN4x) = [{N-l)h"-{27rr/^]-^j:^^^^^^ 
X;|p/(2/i^)}, a m-variate cross- validated kernel estimator of /. No special 
care is taken for the choice of h and we simply use the value that mini- 
mizes the asymptotic mean integrated squared error for the estimation of 
/, that is, h = [4/(m + 2)] V(™+4) Ar-V{m+4) ^^^j^ m = 3; see [34], page 152. 
The evolution of -ffjvg ^ function of q is plotted in dotted-line on Fig- 
ure 1 (left): although the situation is favorable to kernel density estimation, 
kth nearest neighbors give a better estimation of H* for q> 1 and k large 
enough. Figure 1 (right) shows N times the empirical mean-squared error 
(MSE) E{H^ i.^g - i^iV.g)^ = 1' 3, 5) as a function of q using 1 000 indepen- 
dent repetitions. The results for N times the MSE E(i/^g — for the 




DtE345fi 0123496 

1 q 

Fig. 1. Behavior of estimators of entropy for samples from the normal distribution 
N(Q,h) m (N = 1000^. [Left] H* (solid line), HN,k.q (dashed lines) and H^ ^ ob- 
tained through a kernel estimation of f (dotted line) as functions of q. [Right] N = 1000 
times the empirical MSE for Hj^^j^^q [k = l (dots), fc = 3 (circles), k = 5 (squares)] and for 
HM,q (plus) as a function of q and computed over 1 000 independent samples. 
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plug-in estimator are also shown. The figure indicates that the kth. nearest 
neighbor estimator with k satisfying q < {k + l)/2 is favorable in comparison 
to the plug-in estimator (for q> I values of k larger than 1 are preferable, 
whereas = 1 is preferable, for q <1). 

Similar results hold for the Student distribution for T[u, S, fi) in with 4 
degrees of freedom, T, = and = 0; see Figure 2. In selecting k for H"^ ^ ^, 
large values of k are still generally preferable when q> 1. 

At this stage, the optimal selection of k in lN,k,q depending on q and N 
remains an open issue (see Sections 3.2 and 5). We repeated a series of inten- 
sive simulations to see how the MSE E(/7v,fc,g — Iq)'^ evolves when k varies, 
for different choices of N, q and m. Figure 3 shows the influence of N on the 
MSE for lN,k,q for different values of q using 10 000 independent repetitions, 
for / the density of the standard normal AA(0, 1) and the normal M^O,!^). 
For both m = l and m = 3 changes in N appear to have a greater influence 
on N times the MSE for q = 1.1 in comparison to g = 4. In particular, the 
figure indicates that for m = 3 and q = 1.1 the MSE decreases more slowly 
than 1/A^. Figure 4 shows the influence of g on times the MSE for lN,k,q 
as k varies. 

Although our simulations do not reveal a precise rule for choosing k, they 
indicate that this choice is not critical for practical applications: taking k 
between 5 and 10 for q<2 and increasing from 10 to 20 for q from 2 to 4 
gives reasonably good results for the cases we considered. 

4.2. Information spectrum, estimation o/var[log f{X)] . We use the method 
suggested in Remark 3.2 and estimate S{f) = var[log /(A)] by Sn,i given 




Fig. 2. Same information as in Figure 1 but for the Student distribution T{u,Ti,ii) in 
with 4 degrees of freedom (Y. = h, = N = 1000 
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g = 1 . 1, A/"(0, 1 ) f/ = .1. ,V[ t). 1 ) 7 = 1 . 1 , /:j ) (J = 4, A/'CO, ) 



Fig. 3. A'^ times the empirical MSE for iN,k.q as a function of k (10 000 independent 
repetitions), for f the density of the standard normal A/'(0, 1) and Af{0,h) m for 
varying N {N = 1000 (dots), 2 000 (stars), 5 000 (circles) and 10 000 (squares)} and 
g = 1.1 and q — 4. 



by (3.11) from a sample of 50 000 data generated with the Student distri- 
bution with 5 degrees of freedom. S{fu) is a decreasing function of u and 
S{fi) ~ 0.9661, S{h) ^ 0.8588, 5(/6) ~ 0.7911; see Section 2.3. The empiri- 
cal mean and standard deviation of Sn^i obtained from 10 000 independent 
repetitions are 0.8578 and 0.0269 respectively, indicating that u can be cor- 
rectly estimated in this way. 

4.3. Estimation of Kullhack-Leihler divergence. We use the same Stu- 
dent data as in 4.2 and estimate the Kullback-Leibler relative entropy 
K{f,f^) given by (3.12), using (3.16) for the estimation of Hi and (3.9) 



0.05000 




l-OOE+OO 



1.Z5E-0* 




10 20 30 40 50 



10 30 30 40 50 
k 



Fig. 4. A'' times the empirical MSE for /jv,fc,q as a function of k (10 000 independent 
repetitions), for f the density of the standard normal A/'(0, 1) and jV(0,73) in for 
varying q {g = 0.75 (dots), q = 0.95 (circles), q = 1.1 (squares) and q = 2 (stars)} and 
AT = 1000. 
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4.7 1 , , . , , . , . , 1 

4.6 ■ 

45 . 

4.4 . . ■ 

4.3 

i2-- ^ 

4.1 ■ 

4 ■ 
3.9 ■ 

^■^0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
P 

Fig. 5. Empirical means of H'^ ^ g 75 (solid line) and //]v,3,i (dashed line) and two stan- 
dard deviations (vertical bars) in a mixture of Student and normal distributions as func- 
tions of the mixture coefficient jS for N — 500 (1 000 independent repetitions). 

for the estimation of Hi, the entropy of /. The empirical means of the di- 
vergences estimated for u = 1, . . . ,8 in 10 000 independent repetitions are 
0.1657, 0.0440, 0.0119, 0.0021, 0.0000, 0.0012, 0.0038 and 0.0069 [the em- 
pirical standard deviations are rather large, approximately 0.0067 for each 
u, but the minimum is at = 5 in all the 10 000 cases — notice that the 
dependence in u is only through the term (3.16) where fi, is substituted for 
g]. Again, 1/ is correctly estimated in this way. 

4.4. q-entropy maximizing distributions. We generate N = 500 i.i.d. sam- 
ples from the mixture of the three-dimensional Student distribution T{i', [v — 
2)/^'/3,0) with u = h and the normal distribution AA(0,l3), with relative 
weights (3 and 1 — /?. The covariance matrix of both distributions is the 
identity Is, the Student distribution is g-entropy maximizing for q = \ — 
Ijiy + m) = 0.75 (see Section 2.2) and the normal distribution maximizes 
Shannon entropy (g = 1). Figure 5 presents a plot of -ffjvfcg ^ ~ '-'•'''5 
and HN,k,i as functions of the mixture coefficient /?; both use k = 3 and 
are averaged over 1 000 repetitions, the vertical bars indicate two empirical 
standard deviations. [The values of i?o.75 estimated by plug-in using the ker- 
nel estimator fN,i{-c) of Example 1 are totally out of the range for Student 
distributed variables due to the use of a nonadaptive bandwidth.] 

5. Related results and further developments. The paper by Jimenez and 
Yukich [19] gives a method for estimating statistical distances between distri- 
butions with densities / and g based on Voronoi tessellations. Given an i.i.d. 
sample from /, it relies on the comparison between the Lebesgue measure 
(volume) and the measure for g of the Voronoi cells (polyhedra) constructed 
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from the sample. Voronoi tessellations are also used in [27] to estimate the 
Shannon entropy of / based on an i.i.d. sample. The method requires the 
computation of the volumes of the Voronoi cells and no asymptotic result 
is given. Comparatively, the method based on nearest neighbors does not 
require any computation of (multidimensional) integrals. A possible moti- 
vation for using Voronoi tessellations could be the natural adaptation to 
the shape of the distribution. One may then notice that the metric used to 
compute nearest-neighbor distances can be adapted to the observed sam- 
ple: for Xi, . . . ,Xn, a sample having a nonspherical distribution, its em- 
pirical covariance matrix S^r can be used to define a new metric through 
\\x\\% = x~^T,7}x, the volume Vm of the unit ball in this metric becoming 

||.^|l/2^m/2/p(^/2 + l). 

■\/]V-consistency of an estimator of Hi based on nearest-neighbor dis- 
tances (A; = 1) is proved by Tsybakov and van der Meulen [39] for m = 1 
and sufficiently regular densities / with unbounded support using a trunca- 
tion argument. On the other hand, \/iV-consistency of the estimator lN,k,q 
is still an open issue (notice that the bias approximations of Section 3.1 
indicate that it does not hold for large m). As for the case of spacing meth- 
ods, where the spacing can be taken as an increasing function of the sample 
size N (see, e.g., [12, 40, 41]) it might be of interest to let k = kiy increase 
with A^; see also [35] and Section 3.2. Properties of nearest-neighbor dis- 
tances with ki\f — > oo are considered, for instance, by Devroye and Wagner 
[7], Liero [24], Loftsgaarden and Quesenberry [26] and Moore and Yackel 
[28]. The derivation of an estimate of the asymptotic mean-squared error 
of the estimator could be used in a standard way to construct a rule for 
choosing A; as a function of q, m and N (see Sections 3.1 and 3.2). Nu- 
merical simulations indicate, however, that this choice is not as critical as 
that of the bandwidth in a kernel density estimator used for plug-in entropy 
estimation; see Section 4. 

A central limit theorem for functions h{p) of nearest-neighbor distances 
is obtained by Bickel and Breiman [4] for k = 1 and by Penrose [30] for 
k = k]y ^ oo as A^ — > oo. However, their results do not apply to unbounded 
functions of p, such as h{p) = p"^^^~i) [see (3.1)], or h{p) = log(p) [see (3.9)]. 
Conditions for the asymptotic normality of lN,k,q are under current investi- 
gation. 

6. Proofs. The following lemma summarizes some properties of Iq. 
Lemma 1. 

(i) If f is bounded, then Ig < oo for any q> 1. 

(ii) If Iq < oo for some q <l, then Ig/ < oo for any q' £ (q, 1). 

(iii) // / is of finite support, Ig < oo for any q £ [0, 1). 
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Proof. 

(i) Uf{x) < / and g > 1, = /^<i /^ + //>i P < //<! f + f" //>i / < oo. 

(ii) liq<q'<l, I,, = /^<, + Jf^, P' < /^<, P + Jf^, / < cx) if < 

oo. 

(iii) If ^5 = f^c{x : fix) > 0} < cx) and < g < 1, = /^<i /" + /^^^ < 
^5 + //>i/<oo. □ 

The proofs of Theorems 3.1 and 3.2 use the following lemmas. 

Lemma 2 [Lebesgue (1910)]. If g ^ Li(R"*), t/ien /or any sequence of 
open halls B{x,Rk) of radius tending to zero as k ^ oo and for fic-almost 
any x G M™, 



fc^oo VmRY. Jb x.Rl.) 



Lemma 3. For any /? > 0, 

roo roc 

(6.1) / x^F{dx)=(3 x^~\l- F{x)]dx 
Jo Jo 

and 

poc poo 

(6.2) / x-^F{dx) = f3 x-^-^F{x)dx, 
Jo Jo 

in the sense that if one side converges so does the other. 

Proof. See [9], volume 2, page 150, for (6.1). The proof is similar for 
(6.2). Define a = —(5 < and = fa x°'F{dx) for some a, b, with < a < 
b < oo. Integration by parts gives lafi = [b°'F{b) — a°'F{a)] — aj^ x'^~^F{x) dx 
and, since a < 0, limi,_^^ la^b = Ia,oo = —a°F{a) — a x°'~^F{x) dx < oo. 
Suppose that x~^F{dx) = J < oo. It implies lim^_^o+ Io,a = and, since 
Io,a > a°'F{a), lim„^Q+ a'^F^a) = 0. Therefore, hm^_^Q+ —a x'^~^F{x) dx = 
J.' 

Conversely, suppose that lim.a^Q+ —a x"''^ F{x)dx = J < oo. Since 
Ia,oo < -aj^ 2;""^F(x) dx, hma^o+ Ia,oo = J- □ 

6.1. Proof of Theorem 3.1. Since the XiS are i.i.d., 

where the random variable CN,i,k is defined by (3.2). Its distribution function 
conditional to = x is given by 

FN,x,kiu) = PiiCN,i,k < u\Xi = x)= Pr[p^*^^_^ < RN{u)\Xi = x], 
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where 

(6.3) Rn{u) = {u/iiN - 1)VMV^"'- 

Let B{x, r) be the open ball of center x and radius r. We have 
FN,x,k{u) = Fr{k elements or more G B[x, Rj\[{u)]} 

j=k 



j=0 



where pn,u = J^[x Rn{u)] Fi'o™ Poisson approximation of binomial 

distribution, Lemma 2 gives 

k — l (\y\j 

FN,x,k{u) Fx,k{u) = 1 - exp(-A'u) ^ — — 

j=o ^■ 

when — > oo for ^-almost any x, with A = f{x)/Ck, that is, Fj^^x,k tends to 
the Erlang distribution F^^ki with p.d.f. fx,k{u) = [A*^ti'^~^ exp(— An)]/r(A;). 
Direct calculation gives 



oo 



-'-'f^Au)du=^^^±yi = r-\x) 

for any q < k + 1. 

Suppose first that g < 1 and consider the random variables {U,X) with 
joint p.d.f. fN,x,k{u)f{x) on M X R"^, where fN,x,kiu) = dFN,x,kiu)/du. The 
function u — > u^~'^ is bounded on every bounded interval and the generalized 
Helly-Bray Lemma (see [25], page 187) implies 

^lN,k,q= / U^~'^fN,x,k{u)f{x)dudx 



[ f{x)dx = Ig, N^oo, 



which completes the proof. 

Suppose now that 1 < q < k + 1. Note that from Lemma l(i) Ig < oo. 
Consider 

/•oo 



We show that sup^ J^r < oo for some 5 > 0. From Theorem 2.5.1 of Bierens 
[5], page 34, it implies 

oo rOD 



ZN,k{x) = / u'-''FN,x,k{du) ^ Zk{x) = / u'^'^Fx,k{du) = f~\x), 
(6.4) ^° 

oo 
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for ^-almost any x in M™. 

Define (3 = {I - q){l + 5), so that < 0, and take 6 < {k + 1 - q)/{q - 1) 
so that P + k>0. Prom (6.2), 

Jn = -/3 u'^~^FN,x,k{u) du 
Jo 

rl roc 

(6.5) 

ri roo 

= 1-13 [\'^-^FN,xAu)du. 
Jo 

Since f{x) is bounded, say, by /, we have \/x G M"*, \/u G M, ViV, p7v,n < 
fVm[RN{u)r = h/[{N - l)Ck]. It imphes 



FN,x,k{u) 



<E 

j=k 



N-l\ pui-^ 

j )ci{N-iy 



< 



E 

j=fc 



+ E 



i=A;+l 



fk fk N—k—l fi A 

<J— + J— V ^ 



and thus, for u < 1, 

(6.6) 



+ 



exp 



Therefore, from (6.5), 

(6.7) JN<l-(3Uk / = 1 

Jo 



Ck 

pUk 
k + 13 



1 . 



< oo, 



which imphes (6.4). Now we only need to prove that 



ZN,k{^)f{x)dx- 



Zk{x)f{x)dx = Iq, N^oo. 



But this fohows from Lebesgue's bounded convergence theorem, since ZN,k{x) 
is bounded (take (5 = in J^)- 
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6.2. Proof of Theorem 3.2. We shall use the same notations as in the 
proof of Theorem 3.1 and write iN,k,q = (1/-^) '}lf=i C]^ik^ ^° t\iaX 

H^N,k,q - ^q) - a7 

(6.8) 

We consider the cases q<l and q> \ separately. 

q <1. Note that 2q — 1 < q <1 and Lemma l(ii) gives Iq < oo when 
l2q-~i < oo. Consider the first term on the right-hand side of (6.8). We have 

(6-9) mN7,k - Iq? = ncN7,k? + il - '^h^Uk^ 

where the last term tends to —2/^ from Theorem 3.1. Consider the first 
term, 

nCnlk? = I /"^'^'"''V^,.,fc(^^)/(^) dudx. 
Jw^ Jo 

Since the function u — > u^~'' is bounded on every bounded interval, it tends 
to 

' n2(^-'')/.,.(n)/(x) dudx = "f'^^f^ 

for any q < {k + 2)/2 (generalized Helly-Bray lemma, Loeve [25], page 187). 
Therefore, E(^^~'^^ — /g)^ tends to a finite limit and the first term on the 
right-hand side of (6.8) tends to zero as iV — > oo. 

Consider now the second term of (6.8). We show that 

mcN7,k-mNik-iq)} 

Since ^Cl^i'k ~^ Iq ^om Theorem 3.1, we only need to show that 
E{C^.^,C^?J Define 

FN,x,y,k{'^,v) = Pr{C7V,j,fc < U, CN,j,k < v\Xi = X,Xj = y}, 

= Pr{/ji;^iv-i < ^A^(^)' Pk^N-i < RN{v)\Xi = x,Xj = y}, 

so that 
(6.10) 

u^~'^v^~''FM,x,y,k{du, dv)f{x)f{y) dx dy. 



m .rom 



JO 



A CLASS OF RENYI INFORMATION ESTIMATORS 23 

Let us assume that x^y. Prom the definition of Rn{u) [see (6.3)] there exist 
iVo = No{x,y,u,v) such that B[x,RNiu)]nB[y,RNiv)] = ior N > Nq and 
thus, 

FN,x,y,kiu,v) = L L j j I ) 

with pN,u = lB[x,Rr,{u)] /(*) dt, PN,v = lB[y,RM{v)] ^^^^^^^e, for N > Nq, 

j=0 1=0 ^ ^ ^ ' 

Similarly to the proof of Theorem 3.1, we then obtain 

(6.11) FN,x,y,k{'^^'")^Fx,y,k{'^^'")=Fx,ki'^)Fy,ki'")^ N ^ oo, 
for ^£-almost any x and y with 

POO /"OO 

(6.12) / / u'-%'"'F,,y,k{du,dv)=f-\x)r-Hy), 
Jo Jo 

for any q < k + 1. Since the function u — > u^~'^ is bounded on every bounded 
interval, (6.10) gives 

E{C^7fcd7J^/ / n^)ny)dxdy = ll N^oo 

(generalized Helly-Bray lemma, [25], page 187). This completes the proof 
that E(/7v^fc^g — Iqf 0. Therefore, iN,k,q ^ Iq, when N ^ oo. 

q> 1. Note that from Lemma l(i) Iq and /2g-i both exist. Consider the 
first term on the right-hand side of (6.8). We have again (6.9) where the last 
term tends to —21^ (the assumptions of the theorem imply g < A; + 1 so that 
Theorem 3.1 applies). Consider the first term of (6.9). Define 

J'^= n2(i-«)(i+^)F;v,.,.(d^), 

JO 

we show that sup^y J'j^ < oo for some 5 > 0. From the assumptions of the 
theorem, 2q < k + 2. Let /? = 2(1 - q){l + 6), so that j3 <0 and take 6 < 
{k + 2 — 2q)/[2{q — 1)] so that + k> 0. Using Lemma 3 and developments 
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similar to the proof of Theorem 3.1, we obtain 

/•oo rl 

J'n = -P uf-^FN^.,^k{du) <1-I3 u^-^FN,^,k{du) 



<l-(3Uk C u^^l^-^du = l 
Jo 



< oo, 



where is given by (6.6). Theorem 2.5.1 of Bierens [5] then imphes 

1 roc 



_ r(fc + 2-2g)r(fc) 

- r^^k + l-q) ^ ^''> 

for /i-almost any x, q < {k + 2)/2 and Lebesgue's bounded convergence the- 
orem gives HCn~%)^ ^ l2q-iT{k + 2 - 2q)T{k)/r^k + 1 - g), TV ^ oo. The 
first term of (6.8) thus tends to zero. 

Consider now the second term. As in the case g < 1, we only need to show 
that E{C^,^,Clr,?J- 12. Define 

poo poo 

J';,= / n(l-^)(l+^)^;(l-'')(l+^)F^,.,,,fc(d^,d^). 
JO Jo 

Using (6.11, 6.12), proving that sup^ < J{x,y) < oo for some 6 > will 
then establish that 

coo roo 

/ u^^'^v^~''FN,x,y,k{du,dv) 

(6.13) "° ^° 

for /i-almost x and y; see Theorem 2.5.1 of Bierens [5]. Using (6.10), if 

(6.14) / / J{x,y)f{x)f{y)dxdy< CO, 



Lebesgue's dominated convergence theorem will then complete the proof. 
Integration by parts, as in the proof of Lemma 3, gives 



oo poo 



J';j = P^ / ul^-\l^~^FN,.,y,k{u,v)dudv, 
Jo Jo 

where /3 = (1 — q){l + 5) < 0. We use different bounds for FN^x,y,k{u,v) on 
three different parts of the (u, v) plane. 

(i) Suppose that max[i?Ar(ti), i?7v('t^)] < ll^; — which is equivalent to 
(u,i;)GPi = [0,A] X [0,A] with A = A(fc,iV,x,y) = (iV-l)y^Cfc||x-?/|r. 
This means that the balls B[x,Rn{u)\ and B[y,RN{v)\ either do not inter- 
sect, or, when they do, their intersection contains neither x nor y. In that 
case, we use 

FN,x,y,k{u,v) < min[F7v_i,^.,fc(n),F7v_i,j^,fc(t;)] < F]J'^^ ,^ ,^{u)F]J'^^ y ,^{v) 
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and 



1/2 



1 



2/3 + A; /3 



< oo, 



where we used the bound (6.6) for Fj^_i^x,k{u) when u < 1, -F/v-i,x,fc(^^) < 1 
for u > 1 and choose 6 < {k + 2-2q)/[2{q-l)] so that 2/? + A; > [this choice 
of 5 is legitimate since g < (A; + 2) /2] . 

(ii) Suppose, without any loss of generality, that u < v and consider 
the domain defined by Rn{u) < \\x — y\\ < R]\f{v), that is, {u,v) £ 1^2 = 
[0, A] X (A, oo). The cases k = l and k>2 must be treated separately since 
B[y, R]y{v)] contains x. 

When k = l, F]\f^x,y,i{u,v) = F]y-i^x,i{u) and we have 



j: 



"(2) 



N 



(6.15) 



■D2 



V FN,x,y,iiu,v)dudv 

ul^~^FM~i,x,i{u)du 
1 



u^du + 



n^-^ du 



'/-^dv 

A^ 

" f3 



1 



J + 1 P 
< j(2)(x,y) = -/3 



A^ 



1 



ViC^\\x-y\r^, 



where we used (6.6) and take 6 < {2 — q)/{q — 1) so that /? > — 1 (this choice 
of 5 is legitimate since q <2). 

Suppose now that k>2. We have F]^^x,y,k{u, v) < ^ 
Va £ (0,1). Developments similar to those used for the derivation of (6.6) 
give for II < 1 



(6.16) 



FN-l,y,k-l{v) 

q^k 1 



<Vk- 



f 



k-l 



+ 



f 



C^-^k- 1)1 ct 



exp 



Ck 



1 . 
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We obtain 



J' 



"(2) 



N 



02 I y^P lyl3 ^FN^^^y^k{u,V)dudV 



V2 



V! 



N-l,x,k 



(u) du 



+ 



D 

u^''^ du 



fc-i 



■0" 



l-a 



k{l-a) + P (3 



'/-^ dv 



1 



{k-l)a + p P 



< oo, 



where we used (6.6, 6.16) and require (3 + k{l — a) > and (3 + {k — l)a > 0. 
For that we take a = = k/{2k — 1). Indeed, from the assumptions of 
the theorem, q < {k + l)/2 < {k^ + k - l)/{2k - 1) so that we can choose 
5 < [{k'^ + k- l) - q{2k - l)]/[{q - l){2k - 1)], which ensures that both 
l3 + k{l- ak) > and (3 + {k - l)ak > 0. 

(iii) Suppose finally that ||x — y\\ < inm[R]\f{u), R]y{v)], that is, {u,v) £ 
P3 = (A, 00) X (A, 00). In that case, each of the balls B[x, R]y{u)] and B[y, R]\f{v)] 
contains both x and y. Again, the case k = 1 and k >2 must be distin- 
guished. 

When k = l, Fn. .x,y,iiu, v) — 1 and 



FN,x,y,i{u,v) du dv 



(6.17) 



w 



.f~'du 



1 2 



< 



J^'\x,y) = Vl^Cl%-yf-^. 



1 /9 

When k>2, FN,x,y,k{u,v) < f^_-^ 

1 „ A:-l(^) 



,1/2 



J 



"(3) 
N 



(3^ / u^ ^v^ ^FN^^^y^k{u,v)dudv 



Va 



A. 



„/3-l i7l/2 



N~l,y,k- 



_^{v) dv 



<(3' 



V, 



1/2 



1 



1 2 



fc-i 



2p + k-l 13 



< 00, 
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where we used (6.16) and take 5 < [{k + l) -2q]/[2{q-l)] so that fc- 1 + 2/3 > 
[this choice of 5 is legitimate since q< {k + l)/2\. 

Summarizing the three cases above, we obtain J'^ = j'^^'^ + 2 J^^^' + j'^^^ 

with different bounds for j'^"^^ and j'^^^ depending on whether A; = 1 or 
k>2. This proves (6.13). 

When /c > 2, the bound on does not depend on x,y and Lebesgue's 
bounded convergence theorem implies ^{Cn i'k^N jkl ~^ -^g ' which completes 
the proof of the theorem; see (6.14). 

When k = l, the condition (6.14) is satisfied if 2/3 > -1 [see (6.15), (6.17)], 
which is ensured by the choice 6 < {3 — 2q)/[2{q — 1)] (legitimate since q < 
3/2). Indeed, we can write 



m ./IRm 



\x-y\\^f{x)f{y)dxdy= / \\x\\'^ g{x) dx, 
where g{x) = J^m f{x + y)f{y) dy, and thus (since 7 < 0), 

/ / \\x-y\\'^f{x)f{y)dxdy<p WxW^dx + h 

JM™ JR"» J\\x\\<l 



IfII<i 

J \ ^^2, 

7 + m 



when 7 > — m. When 5 < (3 — 2q)/\2[q — 1)], Lebesgue's dominated conver- 



gence theorem thus implies ^{Cn i^k^N ^k} ~^ ' '^hich completes the proof 



of the theorem. 

6.3. Proof of Corollary 3.2. The existence of Hi directly follows from 
that of Iq-^ for qi <1 and the boundedness of /. 

Asymptotic unbiasedness. We have 

EHN,k,i = ElogCN,i,k = E[E(loge7v,i,fc|X, = x)], 

where the only difference between the random variables CN,i,k (3.10) and 
(,N,i,k (3.2) is the substitution of exp[— ^'(/c)] for Ck. Similarly to the proof 
of Theorem 3.1, we define -F/v,a;,fc('u) = Pr(^Ar,i,fc < u\Xi = x) = Pr[p^*^^_-^ < 
RN{u)\Xi = x] with now i?^(?x) = («/{(iV - 1)T/^ exp[-^'(A:)]})i/™. Follow- 
ing the same steps as in the proof of Theorem 3.1, we then obtain 

fc— 1 ^ 

FN,x,k{y) F^^kiu) = 1 - exp(-Au) — N^oo, 

j=o 3- 

for yU£-almost any x, with A = f{x) exp[\I'(/c)]. 
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Direct calculation gives /q°° log{u)Fx^k (du) = — log /(x) . We shall use again 
Theorem 2.5.1 of Bierens [5], page 34, and show that 

POO 

(6.18) Jn= I log{u)\^+^FN,x,k{du) < oo, 

Jo 

for some 6 > 0, which implies 

/•oo /"OO 

/ log{u)FN,x,k{(^u) ^ log{u) Fx^k{du) = - log f{x), N^oo, 
JO Jo 

for ^£-almost any x. The convergence 

/ / log{u)FN,x,k{du)f{x)dx^Hi, N^oo, 

JR™ Jo 

then follows from Lebesgue's bounded convergence theorem. 
In order to prove (6.18), we write 

(6.19) Jn= \\og{u)\'+^FN,x,k{du)+ \\og{u)\'+^FN,x,k{du). 

Jo Jl 

Since / is bounded, we can take ^2 > 1 (and smaller than A; + 1) such that 
j~ u^-g^FN^^^k{du) < oo; see (6.7). Since | log(n)|^+V^i^"'" ^ when n ^ 0, 
it implies that the first integral on the right-hand side of (6.19) is finite. Sim- 
ilarly, since, by assumption, Ig^ exists for some qi < 1, u^~'^^ FN^x,k{du) < 
oo and | log(n)|^"*"^/n^~'^^ 0, u ^ oo, implies that the second integral 
on the right-hand side of (6.19) is finite, which completes the proof that 
Ef^'^^fc^i — > Hi as — > oo. 

L2 consistency. Similarly to the proof of asymptotic unbiasedness, we 
only need to replace CN,i,k (3.10) by S,N,i,k (3.2) and Ck by exp[— ^'(fc)] in 
the proof of Theorem 3.2. When we now compute 

m.,,-Hir^^^^^^i^f^ 

(6.20) ^ 

+ ]^E^^{(logejv,i,fc - Hi){log^N,j,k - Hi)}, 

in the first term, E(log^Ar^i^fc — Hi)'^ tends to 

/ log2 fix) fix) dx - Hi + ilik) = var[log /(X)] + ^(A:), 

where ^(-z) is the trigamma function, '^iz) = d? XogV iz) / dz^ , and for the 
second term the developments are similar to those in Theorem 3.2. For 
instance, equation (6.13) now becomes f^\ogu\ogvF]^^x,y,kidu,dv) 
log /(x) log /(y), N ^ 00, for ^u-almost x and y. We can then show that 
E{loge^,i,fcloge7v,i,fc} ^ Hi so that E(i?^,fc,i - Hi)^ ^ 0, iV ^ 00. 
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